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Abstract 

Background: Living cells are realized by complex gene expression programs that are moderated by regulatory 
proteins called transcription factors (TFs). The TFs control the differential expression of target genes in the context 
of transcriptional regulatory networks (TRNs), either individually or in groups. Deciphering the mechanisms of how 
the TFs control the expression of target genes is a challenging task, especially when multiple TFs collaboratively 
participate in the transcriptional regulation. 

Results: We model the underlying regulatory interactions in terms of the directions (activation or repression) and 
their logical roles (necessary and/or sufficient) with a modified association rule mining approach, called mTRIM. The 
experiment on Yeast discovered 670 regulatory interactions, in which multiple TFs express their functions on 
common target genes collaboratively. The evaluation on yeast genetic interactions, TF knockouts and a synthetic 
dataset shows that our algorithm is significantly better than the existing ones. 

Conclusions: mTRIM is a novel method to infer TF collaborations in transcriptional regulation networks. mTRIM is 
available at http://www.msu.edu/~jinchen/mTRIM. 




Background 

The complex gene expression programs in living cells are 
moderated by regulatory proteins called transcription 
factors (TFs) [1]. In the context of a transcriptional regu- 
latory network (TRN), a TF may act independently or 
collaboratively with other TFs [2], leading to complex 
regulatory interactions that influence the transcription of 
target genes [3,4]. A regulatory interaction includes target 
genes and all the TFs that control their transcriptional 
activities. An individual-TF regulatory interaction has 
been defined in terms of two properties: the TF's func- 
tional role as an activator or a repressor, and its logical 
role as being necessary or sufficient (see Figure la) [3,5]. 
The categories in the TF's functional and logical roles are 
combinable; they can be activator necessary (AN), activa- 
tor sufficient (AS), or activator necessary and sufficient 
(ANS). For example, pheromone response elements are 
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necessary and sufficient for basal and pheromone- 
induced transcription of the FUS1 gene of yeast [6]. 
Similarly for TFs that are repressors, they can be RN, RS 
or RNS [7], In a multiple-TF regulatory interaction, a 
group of TFs collaborate to control the expression levels 
of the same target genes. The directions of all the TFs in 
the group, therefore, form a transcriptional regulation 
pattern of the target genes. Recent developments in bio- 
technology (such as ChIP [8] and yeast one-hybrid [9]) 
have been applied to uncover TF-target binding relation- 
ships [10,11] to reconstruct draft regulatory circuits at a 
systems level [3,4,12]. Furthermore, to identify regulatory 
interactions in vivo and consequently reveal their functions, 
TF single/double knockouts and over-expression experi- 
ments have been systematically carried out [13]. However, 
the results of many single or double-knockout (or over- 
expression) experiments are often non-conclusive [14], 
since many genes are regulated by multiple TFs with com- 
plementary functions [4]. For example, in yeast (one of the 
most well-studied eukaryotic organisms), 47% of genes are 
bound by at least two TFs [15], and approximately 73% 
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Figure 1 (a) Concepts of regulatory interaction and (b) an illustrative example of regulatory interaction in a TRN, in which three TFs 
collaboratively co-regulate two target genes 



(~4,500) of the known genes are non-essential [16], sug- 
gesting that higher order genetic variations are needed for 
precise inference of transcriptional regulations. 

Considering the prohibitive costs and the tremendous 
number of possible combinations of higher-order gene 
knockouts, it is currently impossible for researchers to 
examine all of possible gene knockout combinations 
experimentally. One solution to this problem is to select 
only the TF groups that are most likely to bring about 
the phenotypic change. In order to accomplish this, we 
need to understand the interactions employed by multi- 
ple TFs (called regulatory interactions) to regulate their 
common target genes. However, this is a difficult task, 
because when multiple TFs simultaneously or sequen- 
tially control their target genes, a single gene responds 
to merged inputs, resulting in complex gene expression 
patterns [17,18]. The exhaustive approach requires enu- 
merating all TF combinations, which, given the high 
complexity of combinatorial, is simply impractical at the 
whole genome level. 

In our previous research [19], a Hidden Markov model 
was developed to relate gene expression patterns to regula- 
tory interactions, in order to solve a relatively simpler sub- 
problem that considers only two TFs. To predict regulatory 
interactions for all possible collaborative TFs, we propose 
an algorithm called "mTRIM" (multiple Transcriptional 
Regulatory Interaction Mechanism) in this paper. By unco- 
vering the regulatory interactions in terms of their direc- 
tions (activation or repression) and corresponding logical 
roles (necessary and/or sufficient) from gene expression 
and TF-DNA binding data, mTRIM identifies TF groups 



that are collaboratively responsible for target gene expres- 
sions. Such inferences may provide high-quality candidate 
sets for further experimentally detecting the collaborative 
functions of gene regulations that are largely unknown 
[18]. Yeang and Jaakkola [3] attempted to characterize the 
combinatorial regulation of multiple-TF regulatory interac- 
tions using a heuristic approach to measure how well a 
regulatory module fits the associated binding and gene 
expression data with a log-likelihood function. The regula- 
tory module's likelihood is maximized with a greedy 
approach by incrementally adding genes to the module and 
monitoring the predictions of the regulatory interactions 
for optimality. However, this incremental approach does 
not study the functions of the TFs simultaneously because 
of the scalability issue introduced by the greedy search. 
This method also uses a probability-based approach to cal- 
culate the significance of the combinatorial property of 
TFs, determined by the gap of likelihood scores between 
their model and a model built on randomized data in the 
entire time frame. However, as stated in [4], a TF usually 
functions at specific "activation time points" instead of 
throughout the entire time course, meaning that the identi- 
fication of regulatory interaction modules should be 
focused on activation time-points rather than the entire 
time frame. 

To derive dynamic regulatory networks that associate 
TFs with target genes at their activation time-points, an 
algorithm called DREM was proposed [4]. DREM inte- 
grates time-series gene expression data and protein-DNA 
binding data to build a global temporal map, in order to 
uncover transcriptional regulatory events leading to the 
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observed temporal expression patterns and the underly- 
ing factors that control these events during a cell's 
response to stimuli. The method mainly works by identi- 
fying bifurcation time-points where the expression of a 
subset of genes diverges from the rest of the genes. The 
bifurcation points are then annotated with the TFs regu- 
lating these transitions, which result in a unified temporal 
map. The method can therefore facilitate the determina- 
tion of the time when TFs are exerting their influence, 
and assigns genes to paths in the map based on their 
expression profiles and the TFs that control them. Unlike 
the method by Yeang and Jaakkola [3], DREM's ability to 
derive dynamic maps that associate TFs with the genes 
they regulate and their activation time-points has indeed 
led to better insights for the regulatory module being stu- 
died. However, DREM does not infer the logical roles of 
the TFs (i.e., whether a specific TF is necessary or suffi- 
cient for regulating a set of target genes). Such knowl- 
edge is extremely useful for designing high-order genetic 
variation experiments to understand the complex regula- 
tory mechanisms of biological processes. 

TRIM is an HMM based model which was developed to 
infer the collaboration of at most two TFs that regulate 
the same target genes. In the HMM, the functions of a TF 
are hidden states. The model starts with random priors, 
and then is iteratively trained using EM till convergence. 
Since each possible function of a TF is a node in the 
HMM, there are four nodes (AS, AN, RS, and RN) for 
each TF. With the design of HMM (and the limited train- 
ing data), the number of TFs TRIM can handle is limited. 

The enumeration of all TF combinations is clearly a NP 
problem. Therefore, we focused on the most important 
biological problem (i.e., 2-TF combination) and therefore 
"hardcoded the problem in TRIM. In this paper, we solve 
the efficiency problem by developing an association 
rule mining algorithm which is capable to handle a large 
amount of data with high-level combinations. 

In this paper, we propose a new model mTRIM for 
inferring regulatory interactions for multiple TFs with an 
EM-based Bayesian inference approach [20,21] and a 
modified bottom-up association rule mining method. 
Experimental results evaluated with yeast genetic interac- 
tions, TF knockouts and a synthetic dataset shows that 
our algorithm is significantly better than the existing 
ones. 

Methods 

mTRIM is developed to efficiently infer regulatory inter- 
actions for all possible collaborative TFs in a TRN. The 
feasibility is achieved in two steps. First, an EM-based 
Bayesian inference approach is developed to identify all 
the significant individual TF regulatory interactions, 
meaning that individual TFs that can regulate the target 
genes independent to the existence of other TFs. For the 



TFs which require collaborations with other TFs to drive 
the target genes, or are actually non-deterministic (mean- 
ing lack of clear evidence of regulation), their p-values 
are insignificant. They are considered as the inputs of the 
second step. 

Second, in order to identify the collaboration of k TFs 
(k > 2), i.e., k-T¥ regulatory interaction, a bottom-up 
association rule mining approach is developed. While 
the significant TF groups are reported to the users, the 
insignificant ones are joined with each other to mine 
(k + 1)-TF regulatory interactions. It should be noted 
that unlike the conventional association rule mining 
which seeks the longest possible patterns, mTRIM out- 
puts the shortest significant results, in that the goal of 
mTRIM is to discover the smallest group of TFs that can 
regulate the target genes, so that biological experiments 
with high-order genetic variations can be subsequently 
carried out for the understanding of the behavior of 
TRNs. In terms of time complexity, consider a candidate 
A:-TF regulatory interaction J =< h^, h t j k >=^ h g . The 
algorithm computes AfnScore and p-values of all of the 
subsets, / - {tfj} (V; = 1, 2, k). If one of them is signifi- 
cant, / is immediately pruned. Hence the time complexity 
is O(k) for each candidate /c-TFs regulatory interaction. 
Every merging operation requires at most k - 1 equality 
comparisons. In the best-case scenario, it produces a 
viable candidate Ar-TF interaction. In the worst case, the 
algorithm merges every pair of infrequent (k - 1)-TF 
candidates. Therefore, the overall cost of merging candi- 
dates is between £™ (fe - 2)|P fe | and £™ (fc - 2)|P fe _i | 2 , 
where i\ is the candidate set of Ar-TF regulatory interac- 
tions. To improve the algorithm efficiency, a hash tree is 
constructed for the storage and quick access to all of the 
candidates. Because the maximum depth of the hash 
tree is k, the cost for populating the hash tree of candi- 
dates is OQ^LJ fe|Pfe|)- During candidate pruning, it is 
required to verify whether the k - 1 subsets of every 
candidate /c-TF regulatory interactions are significant. 
Since the cost for looking up an item in a hash tree is 
O(k), the time complex of candidate pruning step 
is 0{J2l 2 k{k-2)\P k \). 

Concepts 

A TRN can be represented as a directed graph in which 
each node is a TF or a gene, and each edge pointing 
from a TF to a gene represents a regulation relationship 
between them. In many organisms, in-depth transcrip- 
tome analysis has revealed the modular architecture of 
gene expression [22]. A regulatory module is a self-con- 
sistent regulatory unit R(TF, G, I) representing a set of 
co-expressed genes G = {gi, g 2 , g„} regulated in con- 
cert by a group of TFs in TF = {tf x , tf 2 , tf m } that gov- 
ern the target genes' behaviors via regulatory interaction 
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/ [5]. An example of the regulatory module is shown in 
Figure lb. 

A regulatory interaction I =< V t ' Vc •••• V™ >= ^ \ 
(which is the final output of mTRIM) is defined as a set of 
TFs {tf u tf m } co-regulating a set of genes {g lt g n }, 
where h t j i is the behavior of TF i; h g is the behavior of all 
the target genes in R, and h x G {], \, -}, meaning up- 
express, down-express and no change respectively. For 
example, if tf x \ and tf 2 i always cause the target genes 
gi and g 2 to be up-regulated, the regulatory interaction is 
<tfi |, tf 2 l> => g |. For individual regulatory interactions, 
/ e {AN, AS, RN, RS, ANS, RNS}. In this work, we assume 
that a regulatory interaction is consistent in the context of 
transcriptional control as long as the experimental condi- 
tions are unchanged. Note that binaries gene expression 
values are used in mTRIM, since TF activity is not always 
proportional to its mRNA abundance [23]. 

mTRIM Step 1. Inferring individual regulatory interactions 

To solve a relatively easier problem of inferring the reg- 
ulatory interactions for each individual TF and to pre- 
pare input for multi-TF regulatory interaction inference, 
an EM-based Bayesian inference algorithm has been 
developed [20,21]. 

To define the probabilities in Eq. 2 and Eq. 3, we fol- 
lowed the definitions in [20]. Eq 2 represents the prior 
probability of the interaction model I m , and Eq 3 repre- 
sents the probability of gene expression correlation 
between TFs and targets given the interaction model I m . 
In the Bayesian model, the training dataset is a matrix 
that contains gene expression levels of TFs and their 
targets, from which T(I m ) is estimated using Eq 4. And 
then, the likelihood is calculated using Eq 3. The prior 
probabilities are randomly assigned initially. In each 
iteration, the posterior probabilities and the frequency 
of l m are updated. The iteration will continue till the 
posterior probabilities converge. 

Let Pos be the posterior probability of a TF tf m to 
have a specific regulatory interaction I m in regulatory 
module R k , where l m e {AN, AS, RN, RS} (ANS and 
RNS will be discussed later). To infer Pos, both the 
prior probabilities Pri and the likelihood Lk of the same 
TF need to be computed, given that: 



Pos{tf m , R k , I,„) = Pri(I,„) x Lk{tf,„, R k , I m ) 



(1) 



where Pri(I m ) is the prior probability of regulatory 
interaction l m (defined in Eq 2) and the likelihood Lk 
{tf m , Rk, I m ) is defined in Eq 3. 

The prior probability Pri{I m ) captures how likely a 
given interaction I m exists given the background of all of 
the other TFs: 



Pri(I m ) 



fre{I m ) 
\R\ x \TF\ 



(2) 



where fre(I m ) is the frequency of regulatory interaction 
I m in all of the regulatory modules, is the number of 
the regulatory modules, | TF\ is the number of TFs, and 
I m e {AS, RS, AN, RN}. 

Given the definition of a regulatory interaction, the 
likelihood Lk{tf m , Rk, I m ) indicates how likely tf m in R k 
has regulatory interaction I m , which is defined by the 
expression level changes of the TF and its targets: 



Lk{tf mi R k , I m ) 



v^l-RI y^ |tf| rfJ \ 

Z^r=l 2-,m=\ 2-,t=\ 2^n=l 1 V 1 ™) 



T-l ■ 



,|G| 



(3) 



where T is the number of time-points in the training 
data, |G| is the number of genes in regulatory module 
Rk, and T(I m ) is defined as: 



r(i m ) 



1 if I m = AS and (r/ m t and g f), or 
if I m = RS and (t/ m f and g i-), or 
if I m = AN and (r/ m 1 and g i-), or 
if I m = RN and (tf m 4- and g f) 

0 otherwise 



(4) 



An expectation-maximization (EM) algorithm is 
adopted to maximize the posterior probabilities Pos(tf m , 
Rh Im)- The EM model is initialized with each TF 
assigned a random regulatory interaction. In the expec- 
tation step, we compute the likelihood of each TF to be 
a specific interaction using Eq 3. Consequently, the pos- 
terior probabilities of interactions for every TF is 
updated with Eq 1. As a result, each TF is assigned with 
the regulatory interaction with the highest posterior 
probability. In the maximization step, we maximize the 
scoring function S(R k ) = £™ T(/ m ) for each reg- 

ulatory module R k , which measures how the interaction 
of each TF in R k matches the target gene expression 
changes. Note that in the iteration the priors are 
updated but the likelihoods are constant. 

Finally, in order to determine whether I m is "necessary 
and sufficient" (ANS and RNS) or "no decision", the fol- 
lowing strategy is adopted: if none of the posterior prob- 
abilities are significant, the output is "no decision"; if the 
probabilities of both N and S states are significant, and 
there is no significant difference between them, the out- 
put is ANS or RNS depending on the target gene expres- 
sion direction; otherwise the output is the regulatory 
interaction with the highest posterior probability. 

An illustrative example is shown in Figure lb, in which 
tfi, tf 2 and if 3 regulate target genes g\ and g 2 , and they all 
belong to the same regulatory module Rk- With the gene 
expression changes in Table 1, we start with equal prior 
probabilities, i.e., Pri(AS) = Pri{RS) = Pri{AN) = Pri{RN) = 
0.25, so Lk(tf lt R k , AN) = 12/26 = 0.461, (Eq 3). After 10 
iterations, in the expectation step, Pri(AN) is updated to 
0.70 (Eq 2), hence Pos(tf u R h RS) = 0.70 x 0.461 = 0.323 
(Eq 1). In the maximization step, we have <tfi {> =>g J,, 
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Table 1 Illustrative example of time-series gene 
expression data for the genes in Figure lb. 
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because the maximum posterior probability is assigned to 
AN with p-value 0.05 (see Table 2 row 1). 

mTRIM Step 2. Mining multiple-TF regulatory interactions 

Besides the individual TF regulatory interactions, a sig- 
nificant portion of TFs collaboratively work together to 
regulate the same target genes. In order to identify these 
multiple-TF regulatory interactions, a new association 
rule mining approach has been developed. Instead of 
using the concepts of support and confidence that are 
commonly used in a conventional association rule 
mining application [24], we define an affinity scoring 
function (called AfnScore) according to the gene expres- 
sion agreement between the TF groups and their target 
genes, to meet the biological meaning of a multiple-TF 
regulatory interaction (see Section Background). Mathe- 
matically, AfnScore of each candidate regulatory interac- 



tion I =< htfj, htf 2 , 
AfnScore(I) 



h„ is calculated with: 



P{h tfl ,h tf2 ,...,h tfm ,h g )*P{h i ) 
P{h tfl ,h tf2 ,. . .,h tfm ) 



(5) 



where P(x) is the number of times that x appears in the 
given time series gene expression dataset divided by the 
product of the total number of time points and the total 
number of target genes. The p-value of each candidate 
regulatory interaction is computed by considering the dis- 
tribution of AfnScore for the regulatory interactions with 
the same number of TFs. Only the candidate interactions 
with p-values smaller than 0.05 are reported to the user. 
Specifically, if all the TFs in / are up-regulated, the TFs are 
"sufficient"; if they are all down-regulated, the TFs 
are "necessary"; otherwise, each TF acts differently to drive 
the target genes to the same direction. 



Table 2 Illustrative example of regulatory interaction 
identification on the TRN in Figure lb. 
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To identify all the significant /c-TF regulatory interac- 
tions, the new association rule mining algorithm starts 
with an empty set Q k and all the insignificant (k - 1)-TF 
interactions saved in P k , 1 (see pseudocode in Figure 2 
line 1). For interactions h =<h t f 1 , h t f k l >=^ h & 
and h =< h' t ^, h' t ^ >=$ h' s in P k , lt we combine 
them and compose a new interaction 7 12 (line 3), if I± 
and I 2 are combinable. We define that Ii and I 2 are 
combinable if and only if they satisfy the conditions that 
h = K' h * = Kfi (for i = 1, 2, .., k - 2) and f h! tfk l . 
If none of the (k - 1)-TF subsets of I i2 is significant (line 
2-8), I12 is added to candidate set C and its AfnScore is 
computed. Finally, we compute p-values for all of the £-TF 
candidates in C using t-test, report all of the significant 
regulatory interactions to the user, and save all the insig- 
nificant ones Pk to for the identification of the (k + 1)-TF 
regulatory interactions (line 9-17). 

For an illustrative example, there are 40 possible mul- 
tiple-TF regulatory interactions in the regulatory mod- 
ule shown in Figure lb. Using the time-series gene 
expression data in Table 1 all the 2-TF regulatory inter- 
action candidates are screened and their p-values are 
computed (see Table 2 row 2-4). Since none of the 2- 
TF regulatory interaction candidates is significant, a 3- 
TF interaction I 4 =<tf 1 ], tf 2 t> tfs !>=>.? T ls generated 
by merging I 2 and I 3 . The AfnScore of 7 4 is ((10/24) * 
(10/24))/(12/24)) = 0.347 and its p-value is 0.04 (see 
Table 2 row 5). Based on I 0 and 7 4 , we conclude that 
the target genes gi and g 2 are induced by the up-expres- 
sion of tfi and tf 2 and the down-expression of tf 3 , and 
the same target genes are repressed by the down- 
expression of tfi. 

Experimental results 

mTRIM was applied on two independently-constructed 
yeast transcriptional regulatory networks (the Harbison 
dataset [15] and the Reimand dataset [12]) to identify 
regulatory interactions. For performance comparison, 
DREM v3.0 [17] and TRIM [19] were both applied on 
the same datasets. We did not compare mTRIM with 
Yeang's method [3] because the latter's objective is to 
build a reliable TRN instead of predicting regulatory 
interactions. We evaluated these methods systematically 
with three independent sources: single TF knockouts 
[16] for individual regulatory interactions, genetic inter- 
actions (GI) [25] for 2-TF regulatory interactions and 
synthetic data for high-order regulatory interactions. 

Using the EM-Based Bayesian inference approach, 658 
significant individual regulatory interactions were mined 
in the Harbison dataset and 164 significant ones were 
mined in the Reimand dataset (Table 3). The results 
show that while many individual TFs drive target genes' 
behaviors, it is clear that most of them (4,414 in the 
Harbison dataset and 1,539 in the Reimand dataset) are 
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Algorithm 1 Procedure Generating Significant Patterns 

Input: Qk-l> Set of significant (k — 1)-TF regulator)' interactions 
Pk—i'- Set of insignificant (k — 1)-TF regulatory interactions 
0: p-value threshold 
Output: Qk- : Set of significant A-TF regulator)' interactions 
Pk'. Set of insignificant A-TF regulator)' interactions 
1 : Q k <- 0; Pk <- 0; candidate set C <- 0; 
2: for all /i. 7 2 6 P|.._i with 7i = <h l f l ,...h l f k _ 2 ,htf k _ i >= ! > 
h 9i . I2 =<htf l ,..,htf k 2 ,h l f/ k ] >=> /i_ 2 ,and/i 9t = /i S2 do 

3: /12 <— <^</i htf k _ 2 ,htf k _ l ,h lf > k > => 

4: if none of the (A - — l)-subset of /12 is in Qk-i then 
5: Compute A fnScore(1 12); 
6: C<-CU{/| 2 }: 
7: end if 
8: end for 
9: for all I G C do 
10: Compute p-values pvalue(I); 
11: if pvalue{I)<0 then 
12: Q fc <- Q*.. U {/}; 
13: else 

14: Pk <r- Pk U {/}; 

15: end if 

16: end for 

17: return Qk and Pk 

Figure 2 Procedure of identifying significant multiple-TF regulatory interactions. 



"no decision". It indicates that a large proportion of TFs 
need to work collaboratively with other TFs. 

Multiple-TF regulatory interactions were inferred with 
a new association mining algorithm. In total, 670 regula- 
tory interactions with multiple TFs were discovered 
(Table 4). The results show that at most 6 TFs collabora- 
tively regulate the same target genes. All the TF combi- 
nations with more than 6 TFs are either insignificant or 
have a significant subset. The whole experiments finished 
in 30 minutes on a high performance computer cluster. 

Data preparation 

Yeast ChlP-chip binding data [15] was downloaded from 
http://younglab.wi.mit.edu/regulatory_code, and a p-value 



cutoff of 0.001 was applied (the same threshold used in 
[4]) to obtain the Harbison dataset. It contains 169 TFs, 
2,864 target genes and 6,253 TF-DNA bindings. Next we 
applied the same statistical approach as in [12] to filter the 
union of the yeast ChlP-chip binding data [26] and the 
binding-site predictions [27,28] to generate the Reimand 
dataset with 2,230 TF-DNA binding relationships between 
268 TFs and 1,509 target genes. To obtain the regulatory 
modules in the TRNs, all the target genes were clustered 
based on their gene expression values with Cluster 3.0 
(specifically, k-means), which uses Pearson correlation 
coefficient for gene similarity metric [29], resulting in 50 
clusters. The clusters are then evaluated with Gene Ontol- 
ogy enrichment analysis using Bingo [30] , and unenriched 



Table 3 The number and type of the regulatory interactions for individual TFs predicted by mTRIM. 

Dataset Necessary Sufficient Necessary & Sufficient No Decision 

Activator Repressor Activator Repressor Activator Repressor 

Harbison 194 184 118 162 29 69 4414 

Reimand 22 43 42 32 7 18 1543 
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Table 4 Number of the multiple-TF regulatory 



interactions identified by mTRIM. 



Dataset 


2-TF 


3-TF 


4-TF 


5-TF 


6-TF 


Harbison 


350 


61 


82 


43 


10 


Reimand 


95 


15 


7 


7 


0 



clusters are discarded. To construct regulatory modules 
from the clustering results, the target genes that are regu- 
lated by the same TFs were partitioned if they are not in 
the same cluster. Finally, 2,172 and 1,031 regulatory mod- 
ules were obtained in the Harbison and Reimand networks 
respectively. The distribution of genes and regulatory 
modules (Figure SI and Table S2 in Additional file 1) 
reveal that many genes are bound by multiple TFs. 

To identify the individual and collaborative regulatory 
interactions in the above datasets, three widely used 
time-series microarray datasets (alpha, CDC28 and elu) 
from yeast cell cycle studies were collected [31] as train- 
ing data. These datasets contain 49 time points in total. 
In these experiments, yeast cells were first synchronized 
to the same cell cycle stage, released from synchroniza- 
tion, and then the total RNA samples were taken at 
even intervals for a period of time (Table SI in Addi- 
tional file 1). In order to decide whether a gene is signif- 
icantly up or down regulated, a gene expression change 
cutoff of 0.35 was applied (the same threshold used 
in [19]). 

To evaluate the individual regulatory relations, single- 
TF knockout microarray data were collected [16], and a 
p-value cut-off of 0.05 (as used in [16]) was applied to 
determine whether a gene is significantly affected by a TF 
knockout. To evaluate the 2-TF regulatory interactions, 
we downloaded the SGA genetic interaction dataset [25], 
which is composed of 1,711 queries crossed to 3,885 
array strains. Of the 1,711 queries, 1,377 are deletion 
mutants of non-essential genes and 334 are essential 
gene alleles. The SGA dataset contains 762,146 genetic 
interactions. Two genes are genetically interacted if 
mutations in both of them produce a phenotype that is 
significantly different to each mutation's individual 
effects. In a 2-TF regulatory interaction, if TFs collabora- 
tively regulate the same target genes, the down-regulation 
of both TFs should have a significantly different pheno- 
type as the down regulation of each individual TF. There- 
fore, such TF pairs should have a significant p-value in 
the GI dataset. To evaluate the high-order multiple-TF 
regulatory interactions, a synthetic binding network were 
built, which contains 11 TFs, 17 target genes and 58 reg- 
ulation/binding relationships. The network also contains 
two feed forward loops. Corresponding time-series gene 
expression data containing 500 time-points were ran- 
domly generated with 10% or 40% noise rate. 



Evaluation 1. Single TF knock-outs 

We used the single TF knockout microarray data to evalu- 
ate the performance of mTRIM on individual TF regula- 
tory interaction predictions in terms of the identification 
of "necessary" TFs (i.e., if the expression values of the 
target genes are significantly changed when the TF is 
knocked out). For the Harbison dataset, the prediction 
precision of mTRIM is 94.44%, higher than the results of 
TRIM (82.50%). Using the Reimand dataset, mTRIM has a 
precision of 91.94%, significantly higher than the results of 
TRIM (61.54%). DREM is not compared since it does not 
predict "necessary" TFs. 

Evaluation 2. Genetic interaction 

In a regulatory module with two TFs, if both TFs colla- 
borate to regulate the same target genes, the down- reg- 
ulation of both TFs should have significantly different 
phenotypes from the down-regulation of each individual 
TF. Therefore, such TF pairs should have a significant 
p-value in the GI dataset. To this end, for the pairs of 
TFs that are predicted by mTRIM to work collabora- 
tively, we adopted the GI dataset [25] for evaluation. 
Figure 3 (a) and 3 (b) shows the Receiver Operating 
Characteristic curve (ROC) of mTRIM, TRIM and 
DREM on Harbison dataset and Reimand dataset 
respectively. For Harbison dataset, the area under curve 
(AUC) of mTRIM is 0.81, much higher than the AUC 
of DREM (0.51) and TRIM (0.75). For Reimand dataset, 
the AUC of mTRIM is 0.80, higher than DREM (0.52) 
and TRIM (0.64). In addition, to explore whether the 
performance of mTRIM is sensitive to parameter set- 
tings, we altered its parameters systematically. For the 
Harbison dataset, Figure S2 in Additional file 1 shows 
the AUC values with different gene expression cutoffs, 
GI cutoffs, and p-value cutoffs of AfnScore respectively. 
Similarly, for Reimand dataset, Figure S2 in Additional 
file 1 shows the varying of the AUC values using differ- 
ent thresholds. These show that our method is robust 
with the GI cutoff and p-value cutoff of AfnScore, 
although its performance gradually decreases with the 
increase of gene expression cutoffs. 

Evaluation 3. Synthetic transcriptional regulatory 
networks 

A synthetic transcriptional regulatory network was gen- 
erated to evaluate the performance of mTRIM in detect- 
ing high-order multiple-TF regulatory interactions (see 
Figure 4). The synthetic network has 28 nodes (11 TFs 
and 17 target genes) and 58 edges, in which the solid 
line represents a real transcriptional regulation and 12 
(20.69%) dotted lines represent TF-DNA bindings but 
no regulation. The dotted lines were added to the net- 
work in order to test the precision of mTRIM. For the 
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synthetic network, two time series gene expression data- 
sets with 500 time-points were generated. In order to 
test the robustness of mTRIM, we repeated the simula- 
tion test twice with different rates of noises added to 
the simulated gene expression data sets. 

( ^ 



tf8 g!6 " 




tfll - gl4 




Figure 4 A synthetic transcriptional regulatory network, in 
which the solid lines represent transcriptional regulations and 
the dotted lines represent TF-DNA bindings only (meaning 
binding but not regulation) 



A comparison between all the three algorithms (see 
Figure 5) indicates that the performance of mTRIM is 
constantly the best on precision, specificity and sensitivity 
(equivalent to recall). Precisely, the precision of mTRIM 
is 87.5%, while the precisions of DREM and TRIM are 
62.5% and 66.67% respectively. The recall of mTRIM is 
significantly higher than TRIM because it identified 4 out 
of 5 regulatory interactions with more than two TFs, 
while TRIM, because of the scalability issue, cannot find 
any regulatory interactions with more than two TFs. It 
also shows that mTRIM is less sensitive to the change of 
the noise rates from 10% to 40% in the gene expression 
data than the other two algorithms. 

Case studies 

In Figure 6a, a 2-TF regulatory interaction that controls 
12 target genes were found in the Harbison dataset. The 
yellow colored nodes are TFs and the green colored 
nodes are their target genes. The red boxes of the 
dotted lines represent the time points when the TFs col- 
laborate with each other to regulate the target genes. 
STE12 (which is activated by a MAP kinase signaling 
cascade) activates genes involved in mating or pseudo- 
hyphal/invasive growth pathways. DIG1 is the MAP 
kinase-responsive inhibitor of STE12. The target genes 
are enriched in "response to pheromone" (6 genes), 
"growth" (3 genes) and so on. The collaboration 
between STE12 and DIG1 on cell growth was captured 
by mTRIM successfully. Another interesting result 
found in the same dataset is a 6-TF regulatory interac- 
tion (Figure 6b). All the six TFs are well-characterized 
in yeast but are considered to function in different path- 
ways. Our finding connects the distinct biological pro- 
cesses, revealing potential TF collaborations at the 
transcription level. 
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Conclusion 

Revealing the mechanisms of the transcriptional regula- 
tory programs in TRNs is essential for understanding 
the complex control by which genes are expressed in 
living cells. The inference of collaborative protein-DNA 
functions helps paving the critical path for new drug 
development. In this work, we identify the regulatory 
interactions between TFs and target genes with mTRIM, 
an integration of an EM-based Bayesian inference and a 
new association rule mining approach built on a set of 
basic constraints that relate gene expression patterns to 
regulatory interactions. mTRIM is not limited by the 
number of TFs. The experimental results show that 
mTRIM is clearly better than the existing algorithms. 
Since it is difficult to obtain the ground truth for algo- 
rithm performance evaluation on real data, we generated 
two sets of synthetic data and used them to validate the 



results of our algorithm. In our future work, we will use 
third-party biological evidences including multiple TF 
knockouts, metabolic pathways, protein-protein interac- 
tions, etc., for biological validation. In our future work, 
we would like to extend this work by including extra 
data in addition to wild-type gene expression datasets. 
For example, since miRNA can degrade the genes 
induced by certain TFs [32], we will consider miRNA- 
target bindings and miRNA expressions, aiming to 
understand how miRNAs and TFs collaborate to regu- 
late target gene expressions. 

Additional material 



Additional file 1: Supplementary Materials for Awad et al. Figures 
SI, S2, Table SI, and S2. This file contains Figures SI, S2, Tables SI and 

S2. 
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